Red Wine Data Analysis by Monica London

The purpose of this analysis is to explore a dataset featuring characteristics about red wine.

Summary

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Structure

str(RedWine)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The red wine data set contains nearly 1600 observations of 13 variables.

Univariate Plots Section

We can see the distribution of quality ratings has a minimum of 3 and a maximum of 8, with most ratings at 5 or 6. Surprisingly, there are no ratings of 1, 2, 9, or 10. I would have expected a larger range of quality ratings with such a large data set.

I divided the data by quality level. Low: Quality of 0-3, Medium: Quality of 4-6, High: Quality of 7-10. We can see that the vast majority of observations fall in the medium quality level.

We can see that the Density and pH plots are the most normally distributed. Thee majority of pH levels fall between 3.0 - 3.5. Many of the plots are skewed to the right, including Free Sulfur Dioxide, Total Sulfur Dioxide. The majority of wines havie less than 100 in total sulfur dioxide. Several of the plots are long tailed, such as Residual Sugar and Chlorides.

The above plots compare the variables before and after transformation. The data for residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide become more normally distributed after applying log10.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, and quality). All of the variables are numeric with the exception of quality, which is in integer form.

Other observations:

Most of the wines have a quality of 5 or 6.

The 3rd quartile of residual sugar levels is 2.6, although there are a few major outliers, with the maximum residual sugar level of 15.5. I’m interested to see if higher residual sugar wines tend to have lower or higher quality.

Most wines have an alcohol content of less than 12%. This surprises me, given that the majority of red wines I’m familiar with have alcohol contents above 13.5%.

Many of the wines have 0 citric acid.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset are quality, and I’d like to determine which variables impact quality ratings the most. I suspect alcohol, residual sugar, and pH contribute to quality ratings, as they seem to be features you may be able to decipher during wine tastings.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

From research into what contributes to the taste of wine, I discovered that sweetness, acidity, tannin, alcohol, and body are the main features. In addition to pH, I think fixed acidity and volatile acidity may contribute to the acidity of wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new variable called quality level, which cuts the quality levels into low (3, 4), medium (5, 6), and high (7,8).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I deleted column X because it was simply a repeat of the index.

I applied log10 to residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide in order to normalize the distributions.

Bivariate Plots Section

I analyzed the following bivariate relationships:

Quality vs. Alcohol Quality vs. pH Quality vs. Residual Sugar Quality vs. Fixed Acidity Quality vs. Volatile Acidity Residual Sugar vs. Alcohol Residual Sugar vs. pH pH vs. Alcohol Fixed Acidity vs. Density Fixed Acidity vs pH pH vs. Citric Acid Quality Level vs. Alcohol Quality Level vs. pH Quality Level vs. Residual Sugar

The correlogram indicates that the majority of relationships between variables are not highly correlated. The strongest relationships appear to be density vs. fixed acidity (r^2 = 0.67), citric acid vs. fixed acidity (r^2 = 0.67, pH vs. fixed acidity (r^2 = -0.68), and total sulfur dioxide vs. free sulfur dioxide (r^2 = 0.67). A correlation between citric acidy and fixed acidity is not surprising as they are both acids. Free sulfur dioxide is part of total sulfur dioxide so a correlation is expected. pH is a measure of acidity so the correlation between pH and fixed acidity is not surprising either. I am unsure what would cause a correlation between density and fixed acidity, but it could be that higher acidic liquid is more dense than lower acidic liquid.

This plot analyzes alcohol content across qualities. Red points represent mean alcohol content by quality and blue points represent outliers. The plot confirms a moderate correlation (r^2 = 0.48). Alcohol content seems to rapidly increase between the moderate and high quality wines.

This plot analyzes pH across qualities. Red points represent the mean pH by quality and blue points represent outliers. The quality vs. pH plot does not seem to indicate any trend or correlation, supporting the r^2 value of 0.06.

This plot analyzes residual sugar on a log10 scale across qualities. Red points represent mean residual sugar by quality and blue points represent outliers. The boxplot of residual sugar across qualities plot does not indicate a correlation between residual sugar and quality, confirming the r^2 value of 0.01.

This plot visualizes fixed acidity across qualities. Red points represent the mean fixed acidity by quality and blue point represent outliers. The boxplot of fixed acidity across qualities shows a very weak positive correlation, confirming the low r^2 value 0f 0.12.

The quality vs. volatile acidity plot shows a moderate negative correlation. This is confirmed by the r^2 value of -0.39.

The residual sugar by alcohol plot does not show a clear trend, which is confirmed by the low r^2 value of 0.04. This is surprising given that lower alcohol wines tend to taste sweeter, leading one to believe that they contain more sugar.

This plot visualizes residual sugar on a log10 scale by pH. There doesn’t appear to be a notable correlation between pH and residual sugar, which is supported by the -0.09 r^2 value.

The pH by alcohol content plot indicates a slight positive correlation, which is confirmed by the 0.21 r^2 value.

The density by fixed acidity plot indicates a strong positive correlation. The smoother highlights this trend. The r^2 value of 0.67 is one of the strongest correlations of any of the variables in the data set.

The pH by fixed acidity plot shows a strong negative correlation. The smoother and the r^2 value of -0.68 confirms this trend. This trend is not surprising given that pH is a measure of acidity and a lower pH indicates higher acidity.

This boxplot visualizes alcohol content by quality level. Outliers are plotted in red. It is interesting that the alcohol content tends to be much higher in the higher quality wines than the medium or low quality wines.

This boxplot visualizes the pH level by quality level. Outliers are plotted in red. The median pH level decreases as the quality increases. The range of the data is lower at the highest quality level.

This boxplot represents residual sugar by quality level. Outliers are plotted in red. The range of outliers is large in this plot, especially in the medium quality level. All of the outliers in all quality levels are high outliers; they have very high levels of residual sugar rather than very low levels of residual sugar.

This boxplot visualizes volatile acidity by quality level. Outliers are plotted in red. There is a clear relationship between quality level and volatile acidity. Both the IQR and median volatile acidity decrease as quality level increases.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The main feature of interest in this analysis is quality, and if any features show an affect on quality. The correlogram shows that the highest r^2 value between quality and any other feature is alcohol (r^2 = 0.48). The alcohol content by quality boxplot highlights this trend. The alcohol content by quality level boxplot shows this relationship even better, with a clear increase in median alcohol levels in the highest quality wines.

The relationship between quality and pH has an r^2 value of 0.06, which indicates practically zero correlation, and the corresponding boxplot confirms this. However, when pH is compared to quality levels, there is a pattern in the boxplot. The median pH levels seem to decrease as the quality level increases, especially between the lower quality and medium quality wines.

The correlogram indicates that there is no correlation (r^2 = 0.01) between residual sugar and quality. Even after transforming the residual sugar data using log10 and plotting it against quality levels, there seems to be no clear relationship betwen residual sugar and quality.

One of the most poignant bivariate relationships discovered was the relationship between quality level and volatile acidity. The correlogram shows an r^2 value of -0.39 between quality and volatile acidity. The volatile acidity by quality boxplot visualizes how as quality increases, volatile acidity decreases. This trend is highlighted further when quality is grouped into levels.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Some of the most interesting relationships were between variables that were not the main feature of interest. In fact, three of the four strongest r^2 values included fixed acidity vs. another variable. Fixed acidity had the strongest correlations with density, citric acid, and pH. As discussed previously, this is not surprising given that many of the variables are either acid themselves or a measure of acidity.

What was the strongest relationship you found?

The strongest relationship, according to the r^2 value, is between pH and fixed acidity (r^2 = 0.68). However, once the data was cut into quality levels, the plots indicate that there are strong relationships between quality level and alcohol, quality level and volatile acidity, and quality level and pH.

Multivariate Plots Section

Because the majority of the data has a medium quality level, the data is highly clustered. The smoother shows a slightly higher fixed acidity vs. density ratio for higher quality level wines vs. medium or lower quality level wines.

This plot doesn’t show strong trends, but it does show how the majority of the data falls in the lower alcohol, lower pH quadrant of the chart.

This plot shows some differences in the relationship between pH and citric acid by quality level. The quality level of wine appears to correlate with the level of citric acid for any given pH.

The relationship between pH and fixed acidity seems to be uniform across the high and medium quality wines. The low quality wines follow a less linear pattern, but this may be attributed to the limited low quality data points.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = RedWine)
## m2: lm(formula = quality ~ alcohol + pH, data = RedWine)
## m3: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)), 
##     data = RedWine)
## m4: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar, data = RedWine)
## m5: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity, data = RedWine)
## m6: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity, data = RedWine)
## m7: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)), 
##     data = RedWine)
## m8: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide, data = RedWine)
## m9: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid, data = RedWine)
## m10: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid + I(log10(chlorides)), 
##     data = RedWine)
## m11: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid + I(log10(chlorides)) + 
##     I(log10(free.sulfur.dioxide)), data = RedWine)
## m12: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid + I(log10(chlorides)) + 
##     I(log10(free.sulfur.dioxide)) + sulphates, data = RedWine)
## 
## ==========================================================================================================================================================================================================
##                                        m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11           m12       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                         1.875***      4.426***      4.526***      4.539***      3.014***      3.559***      3.971***      3.834***      3.901***      3.871***      4.036***      3.359***  
##                                      (0.175)       (0.387)       (0.393)       (0.393)       (0.601)       (0.576)       (0.607)       (0.605)       (0.606)       (0.605)       (0.610)       (0.606)    
##   alcohol                             0.361***      0.386***      0.389***      0.391***      0.386***      0.330***      0.321***      0.325***      0.330***      0.321***      0.315***      0.292***  
##                                      (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.018)       (0.018)       (0.018)    
##   pH                                               -0.850***     -0.870***     -0.872***     -0.506**      -0.259        -0.288        -0.426**      -0.456**      -0.513**      -0.524**      -0.487**   
##                                                    (0.116)       (0.116)       (0.116)       (0.159)       (0.154)       (0.154)       (0.157)       (0.158)       (0.161)       (0.160)       (0.158)    
##   I(log10(residual.sugar))                                       -0.171        -0.495        -0.728*       -0.199        -0.120        -0.142        -0.127        -0.084        -0.031         0.057     
##                                                                  (0.114)       (0.313)       (0.320)       (0.309)       (0.311)       (0.309)       (0.309)       (0.310)       (0.310)       (0.305)    
##   residual.sugar                                                                0.038         0.059         0.012         0.009         0.020         0.020         0.018         0.011         0.009     
##                                                                                (0.034)       (0.035)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)    
##   fixed.acidity                                                                               0.047***      0.023         0.018         0.009         0.022         0.018         0.018         0.017     
##                                                                                              (0.014)       (0.014)       (0.014)       (0.014)       (0.016)       (0.016)       (0.016)       (0.016)    
##   volatile.acidity                                                                                         -1.249***     -1.255***     -1.233***     -1.332***     -1.282***     -1.253***     -1.114***  
##                                                                                                            (0.101)       (0.101)       (0.100)       (0.117)       (0.120)       (0.120)       (0.120)    
##   I(log10(total.sulfur.dioxide))                                                                                         -0.124*        0.424**       0.405**       0.429**       0.210         0.103     
##                                                                                                                          (0.058)       (0.144)       (0.145)       (0.145)       (0.178)       (0.175)    
##   total.sulfur.dioxide                                                                                                                 -0.006***     -0.005***     -0.006***     -0.005***     -0.004**   
##                                                                                                                                        (0.001)       (0.001)       (0.001)       (0.001)       (0.001)    
##   citric.acid                                                                                                                                        -0.234        -0.170        -0.134        -0.226     
##                                                                                                                                                      (0.142)       (0.146)       (0.147)       (0.145)    
##   I(log10(chlorides))                                                                                                                                              -0.248        -0.244        -0.552***  
##                                                                                                                                                                    (0.132)       (0.131)       (0.135)    
##   I(log10(free.sulfur.dioxide))                                                                                                                                                   0.203*        0.198*    
##                                                                                                                                                                                  (0.095)       (0.094)    
##   sulphates                                                                                                                                                                                     0.813***  
##                                                                                                                                                                                                (0.108)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                           0.227         0.252         0.253         0.254         0.259         0.324         0.326         0.333         0.334         0.336         0.338         0.361     
##   adj. R-squared                      0.226         0.251         0.252         0.252         0.257         0.322         0.323         0.330         0.331         0.332         0.333         0.356     
##   sigma                               0.710         0.699         0.699         0.699         0.696         0.665         0.664         0.661         0.661         0.660         0.659         0.648     
##   F                                 468.267       268.888       180.161       135.446       111.286       127.233       109.959        99.324        88.685        80.298        73.574        74.568     
##   p                                   0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood                  -1721.057     -1694.466     -1693.325     -1692.710     -1687.117     -1613.455     -1611.152     -1602.601     -1601.237     -1599.456     -1597.172     -1568.959     
##   Deviance                          805.870       779.508       778.397       777.799       772.376       704.393       702.367       694.895       693.711       692.167       690.192       666.261     
##   AIC                              3448.114      3396.931      3396.650      3397.421      3388.234      3242.909      3240.303      3225.203      3224.475      3222.913      3220.344      3165.918     
##   BIC                              3464.245      3418.440      3423.536      3429.684      3425.874      3285.926      3288.697      3278.974      3283.623      3287.438      3290.247      3241.198     
##   N                                1599          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

When looking at fixed acidity vs. citric acid in terms of quality levels, there does seem to be a relationship. For any given pH, the citric acid level appears to increase as the quality level increases. This relationship isn’t as clear as the pH level increases above 3.5. This may be because there are fewer data points at that level.

In the density vs. fixed acidity in terms of quality level plot, there is a very strong relationship. For a given fixed acidity level below 12, the average density level is higher for lower quality wines than higher quality wines.

Were there any interesting or surprising interactions between features?

In the density vs. fixed acidity in terms of quality level plot, the smoother for the lowest quality wines does not appear linear. It almost appears logarithmic, rising in density value slower as the fixed acidity value increases.

Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a model to predict quality with several variables, including alcohol, pH, residual sugar, fixed acidity, volatile acidity, total sulfur dioxide, citric acid, chlorides, free sulfur dioxide, and sulphates. The r^2 value of the model is 0.361. Because quality ratings are chosen by humans and are not scientific, an r^2 value of 0.361 is relatively strong.

The number of observations with 3, 4, 7, and 8 quality ratings is so much lower than the number of observations with 5 and 6 quality ratings. A bigger overall data set and more observations in the lower and higher quality ratings would improve the model.

Final Plots and Summary

Plot One

Description One

This plot visualizes alcohol content by quality level. Red points represent outliers and blue points represent the mean alcohol content by quality level. We can see a similar mean alcohol content for both low and medium quality wines. Interestingly, the mean alcohol content spikes much higher for the highest quality wines.

Plot Two

Description Two

This plot visualizes volatile acidity by quality level. Red points represent outliers and blue points represent the mean volatile acidity by quality level. We can see a very strong negative relationship between quality level and volatile acidity. As the quality level increases, both the mean volatile acidity and the IQR decreases.

Plot Three

Description Three

This plot is notable because it visualizes the relationship between the two variables with the strongest correlation in the data set. Fixed acidity and pH have an r^2 value of -0.68. This relationship makes a lot of sense because a pH level is a measure of acidity; a lower pH indicates a substance is more acidic and a higher pH indicates a substance is more basic. This plot confirms this relationship.

Reflection

The goal of this exploratory data anlysis was to determine which variables most impacted quality. There were several insights I found during the exploration of this data set.

Alcohol and Quality: There is a clear relationship between quality level and alcohol content, but only for the highest quality wines.

pH and Quality: There is a negative relationship between pH and quality levels. This is unclear until quality is separated into quality levels.

Volatile Acidity and Quality: There is a very strong relationship between volatile relationship and quality.

The correlogram was very helpful in showing correlations between variables except for quality. In several cases, the relationship between quality and a specific variable was unclear until quality was separated into quality levels.

It would be useful to know the type of red wine, such as Cabernet Sauvignon, Pinot Noir, Merlot, etc. It is very difficult to analyze trends when the type of the wine is unknown. For example, a certain wine type may be expected to have more alcohol and therefore someone rating the quality of that wine would rate it more favorably than someone rating a quality of wine that was expected to have a lower alcohol content. Furthermore, it would be interesting to analyze wines from different parts of the world to see if there is a relationship between quality or any of the variables and region.